Data types

Data type N of objects N of timestamps
Cross-section many one
Time-series one many
Panel data many many

Data types

Cross-section

The data give the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s.

data(cars)
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Data types

Time-series

The classic airline data. Monthly totals of international airline passengers, 1949 to 1960.

data(AirPassengers)
AirPassengers
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432

Data types

Panel data

A data frame with 3580 observations on the following 18 variables: id (identifier for panel individual; 716 total), year interviewed (1982, 1983, 1985, 1987, 1988), lwage: ln(wage/GNP deflator), hours (usual hours worked), age (age in current year), educ (current grade completed), etc.

df <- read.csv("http://www.principlesofeconometrics.com/poe5/data/csv/nls_panel.csv")
head(df[,1:6], 10)
##    id year    lwage hours age educ
## 1   1   82 1.808289    38  30   12
## 2   1   83 1.863417    38  31   12
## 3   1   85 1.789367    38  33   12
## 4   1   87 1.846530    40  35   12
## 5   1   88 1.856449    40  37   12
## 6   2   82 1.280933    48  36   17
## 7   2   83 1.515855    43  37   17
## 8   2   85 1.930170    35  39   17
## 9   2   87 1.919034    42  41   17
## 10  2   88 2.200974    42  43   17

Source: https://rdrr.io/github/ccolonescu/PoEdata/man/nls_panel.html

Types of Variables

Types of Variables

Example mtcars

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). ?mtcars for details

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Types of Variables

Example mtcars

A data frame with 32 observations on 11 (numeric) variables.

  • mpg Miles/(US) gallon
  • cyl Number of cylinders
  • disp Displacement (cu.in.)
  • hp Gross horsepower
  • drat Rear axle ratio
  • wt Weight (1000 lbs)
  • qsec 1/4 mile time
  • vs Engine (0 = V-shaped, 1 = straight)
  • am Transmission (0 = automatic, 1 = manual)
  • gear Number of forward gears
  • carb Number of carburetors

Types of Variables

Example mtcars

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Types of Variables

Example mtcars

Continuous variable (mpg), find the mean.

dplyr vs. base-R

# base-R 
mean(mtcars$mpg)
## [1] 20.09062
# dplyr (as a part of tidyverse) #1
library(tidyverse)
mtcars %>% select(mpg) %>% summarise(mean = mean(mpg)) %>% pull()
## [1] 20.09062
# dplyr (as a part of tidyverse) #2
summarise(mtcars, mean(mpg)) %>% pull()
## [1] 20.09062
# dplyr + base-R
mean(mtcars %>% select(mpg) %>% pull())
## [1] 20.09062
mtcars %>% select(mpg) %>% pull() %>% mean()
## [1] 20.09062

Types of Variables

Example mtcars

Binary variable (vs): Engine (0 = V-shaped, 1 = straight), find the mean.

mean(mtcars$vs)
## [1] 0.4375

Mean mpg by vs type

# base-R (data.frame object)
aggregate(mpg ~ vs, data=mtcars, mean)
##   vs      mpg
## 1  0 16.61667
## 2  1 24.55714
# tidyverse (tibble object)
mtcars %>% group_by(vs) %>% summarise(mean(mpg))
## # A tibble: 2 x 2
##      vs `mean(mpg)`
##   <dbl>       <dbl>
## 1     0        16.6
## 2     1        24.6

Types of Variables

Example mtcars

Let's create the factor as car manufacturer.

# row.names contain:
mtcars %>% row.names() %>% head(3)
## [1] "Mazda RX4"     "Mazda RX4 Wag" "Datsun 710"
# the same with base-R
head(row.names(mtcars), 3)
## [1] "Mazda RX4"     "Mazda RX4 Wag" "Datsun 710"
# split by space (base-R list object)
mtcars %>% row.names() %>% strsplit(split = ' ') %>% head(3)
## [[1]]
## [1] "Mazda" "RX4"  
## 
## [[2]]
## [1] "Mazda" "RX4"   "Wag"  
## 
## [[3]]
## [1] "Datsun" "710"
# extract the first element after split and assign to variable
mtcars %>% row.names() %>% strsplit(split = ' ') %>% map(1) %>% unlist() -> manufacturer
# the same assignment
manufacturer <- mtcars %>% row.names() %>% strsplit(split = ' ') %>% map(1) %>% unlist()
# and display
manufacturer %>% head(3)
## [1] "Mazda"  "Mazda"  "Datsun"
# create the column
mtcars %>% mutate(manufacturer) %>% head()
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb manufacturer
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4        Mazda
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4        Mazda
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1       Datsun
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1       Hornet
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2       Hornet
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1      Valiant

Types of Variables

Example mtcars

Find mean mpg by manufacturer.

mtcars %>% mutate(manufacturer) %>%
  group_by(manufacturer) %>% summarise(mean1 = mean(mpg)) %>%
  arrange(desc(mean1)) %>% head()
## # A tibble: 6 x 2
##   manufacturer mean1
##   <chr>        <dbl>
## 1 Honda         30.4
## 2 Lotus         30.4
## 3 Fiat          29.8
## 4 Toyota        27.7
## 5 Porsche       26  
## 6 Datsun        22.8
# group_by two factors
mtcars %>% mutate(manufacturer) %>%
  group_by(manufacturer, am) %>% summarise(mean1 = mean(mpg)) %>%
  arrange(desc(manufacturer)) %>% head()
## # A tibble: 6 x 3
## # Groups:   manufacturer [5]
##   manufacturer    am mean1
##   <chr>        <dbl> <dbl>
## 1 Volvo            1  21.4
## 2 Valiant          0  18.1
## 3 Toyota           0  21.5
## 4 Toyota           1  33.9
## 5 Porsche          1  26  
## 6 Pontiac          0  19.2

Statistical inference

Correlation

Correlation (linear relationship)

Source: https://benwhalley.github.io/rmip/regression.html

Correlation (linear relationship)

Source: https://scpoecon.github.io/ScPoEconometrics/linreg.html#correlation-covariance-and-linearity

Correlation (linear relationship)

Source: https://www.econometrics-with-r.org/3.7-scatterplots-sample-covariance-and-sample-correlation.html

Correlation (linear)

Example mtcars

# Correlation between mpg (Miles/gallon) and wt (Weight)
cor(mtcars$mpg, mtcars$wt)
## [1] -0.8676594
plot(mtcars$wt ~ mtcars$mpg,
     xlab='Miles/(US) gallon',
     ylab='Weight (1000 lbs)')
abline(lm(mtcars$wt ~ mtcars$mpg), col='blue')

Correlation (linear)

Example mtcars

Test for significance (the null hypothesis is about no linear relationship)

cor.test(mtcars$mpg, mtcars$wt)
## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$mpg and mtcars$wt
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9338264 -0.7440872
## sample estimates:
##        cor 
## -0.8676594
# extract p-value and compare
cor.test(mtcars$mpg, mtcars$wt)$p.val < 0.05
## [1] TRUE

Correlation (linear)

Example mtcars

Test for significance (the null hypothesis is about no linear relationship)

# Rear axle ratio (drat) vs. 1/4 mile time (qsec)
cor.test(mtcars$drat, mtcars$qsec)
## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$drat and mtcars$qsec
## t = 0.50164, df = 30, p-value = 0.6196
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.265947  0.426340
## sample estimates:
##        cor 
## 0.09120476
# extract p-value and compare
cor.test(mtcars$drat, mtcars$qsec)$p.val < 0.05
## [1] FALSE
plot(mtcars$qsec ~ mtcars$drat)
abline(lm(mtcars$qsec ~ mtcars$drat), col='blue')

Regression

Correlation provides a measure of the linear association between pairs of variables, but it doesn’t tell us about more complex relationships.

You can use regression to develop a more formal understanding of relationships between variables. In regression, and in statistical modeling in general, we want to model the relationship between an output variable, or a response/dependent, and one or more input variables, or factors/independent variables.

Source: https://www.jmp.com/en_ch/statistics-knowledge-portal/what-is-regression.html

OLS Regression

Simple regression (paired regression)

  • To fit the regression model (e.g. line) Ordinary Least Squares (OLS) Estimator is used
  • The relationship between \(y\) and \(x\) can be performed as linear equation with coefficients \[\hat{y_i}=b_0+b_1\cdot x_i\]
  • "The distances" (in simple terms) to the line are minimized

Source: https://scpoecon.github.io/ScPoEconometrics/linreg.html#correlation-covariance-and-linearity

OLS Regression

Simple regression (paired regression)

But the genuine mechanics is to minimize the sum of squared distances, and, then fit the line

Source: https://scpoecon.github.io/ScPoEconometrics/linreg.html#correlation-covariance-and-linearity

OLS Regression

Simple regression (paired regression)

Playground (one click -> add point, double-click -> remove all)

Source: https://www.econometrics-with-r.org/4.2-estimating-the-coefficients-of-the-linear-regression-model.html

OLS Regression

Simple regression (paired regression)

Example mtcars

Estimate regression coefficients and decide within the null hypothesis \(H_0:b_j=0\) (no relationship)

# wt (Weight) is dependent variable
# mpg (Miles/Gallon) is independent variable
lm(wt ~ mpg, data=mtcars) %>% summary()
## 
## Call:
## lm(formula = wt ~ mpg, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6516 -0.3490 -0.1381  0.3190  1.3684 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.04726    0.30869  19.590  < 2e-16 ***
## mpg         -0.14086    0.01474  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4945 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

OLS Regression

Simple regression (paired regression)

Common representation in articles

library(stargazer)
model1 <- lm(wt ~ mpg, data=mtcars)
stargazer(model1, type = 'text')
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                 wt             
## -----------------------------------------------
## mpg                          -0.141***         
##                               (0.015)          
##                                                
## Constant                     6.047***          
##                               (0.309)          
##                                                
## -----------------------------------------------
## Observations                    32             
## R2                             0.753           
## Adjusted R2                    0.745           
## Residual Std. Error       0.494 (df = 30)      
## F Statistic           91.375*** (df = 1; 30)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

More info: https://bookdown.org/yihui/rmarkdown-cookbook/kable.html

OLS Regression

Multiple regression

  • The relationship between one \(y\) (dependent variable) and several \(x\)s (independent variables) can be performed as linear equation with coefficients \[\hat{y_i}=b_0+b_1\cdot x1_i+b_2\cdot x2_i+b_3\cdot x3_i\]
  • Playground example

OLS Regression

Multiple regression

Estimations from the previous example

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.845 -10.240  -0.308   9.815  43.461 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 686.03225    7.41131  92.566  < 2e-16 ***
## x1           -1.10130    0.38028  -2.896  0.00398 ** 
## x2           -0.64978    0.03934 -16.516  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.46 on 417 degrees of freedom
## Multiple R-squared:  0.4264, Adjusted R-squared:  0.4237 
## F-statistic:   155 on 2 and 417 DF,  p-value: < 2.2e-16

OLS Regression

Multiple regression

model2 <- lm(wt ~ mpg + hp + cyl, data=mtcars)
stargazer(model1, model2, type = 'text')
## 
## =================================================================
##                                  Dependent variable:             
##                     ---------------------------------------------
##                                          wt                      
##                              (1)                    (2)          
## -----------------------------------------------------------------
## mpg                       -0.141***              -0.125***       
##                            (0.015)                (0.029)        
##                                                                  
## hp                                                 -0.002        
##                                                   (0.002)        
##                                                                  
## cyl                                                0.135         
##                                                   (0.112)        
##                                                                  
## Constant                   6.047***               5.186***       
##                            (0.309)                (1.130)        
##                                                                  
## -----------------------------------------------------------------
## Observations                  32                     32          
## R2                          0.753                  0.766         
## Adjusted R2                 0.745                  0.740         
## Residual Std. Error    0.494 (df = 30)        0.498 (df = 28)    
## F Statistic         91.375*** (df = 1; 30) 30.481*** (df = 3; 28)
## =================================================================
## Note:                                 *p<0.1; **p<0.05; ***p<0.01

Note: Measure of fit \(R^2\) (the coefficient of determination) is the fraction of the sample variance of dependent variable that is explained by the factors.

Common issues in usage

Pay attention to the following:

Log transformation example

Source: https://scpoecon.github.io/ScPoEconometrics/linreg.html

Interpretation of common regression specifications

Source: https://scpoecon.github.io/ScPoEconometrics/linreg.html

Seminar Preparation